STUDENT LOANS CHALLENGE

COURSERA ML CHALLENGE

This notebook was created to document the steps taken to solve the Predict Students’ Ability to Repay Educational Loans posted on the Data Science Community in Coursera.

The data is aviable at:

https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv.

Documentation for the data is available at https://collegescorecard.ed.gov/data/documentation/. There is a data dictionary at https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx.

WORKFLOW

The Workflow suggested in https://www.kaggle.com/startupsci/titanic-data-science-solutions is going to be followed. The Workflow is the following:

    Question or problem definition.
    Acquire training and testing data.
    Wrangle, prepare, cleanse the data.
    Analyze, identify patterns, and explore the data.
    Model, predict and solve the problem.
    Visualize, report, and present the problem solving steps and final solution.
    Supply the results.

The workflow indicates general sequence of how each stage may follow the other. However, there are use cases with exceptions:

    We may combine mulitple workflow stages. We may analyze by visualizing data.
    Perform a stage earlier than indicated. We may analyze data before and after wrangling.
    Perform a stage multiple times in our workflow. Visualize stage may be used multiple times.
    Drop a stage altogether. We may not need supply stage to productize or service enable our dataset for a competition.

Problem Definition

Test to see if a set of institutional features can be used to predict student otucomes, in particular debt repayment. This solution is intended to try to explore to what extent instututional characteristics as well as certain demographic factors can indicate or predict debt repayment.

The (US) “College Scorecard” (the data set) includes national data on the earnings of former college graduates and new data on student debt.

Import Libraries

First import the libraries that are going to be used:



In [2]:

    
# data analysis and manipulation
import numpy as np
import pandas as pd
np.set_printoptions(threshold=1000)

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

#machine learning
import tensorflow as tf

#Regular expression
import re

Acqure Data

The data is acquired using pandas (I renamed the file to CollegeScorecardData.csv)



In [3]:

    
all_data = pd.read_csv('datasets/CollegeScorecardData.csv')









    



C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (6,9,1170,1171,1172,1173,1174,1175,1177,1178,1179,1181,1183,1184,1185,1186,1187,1188,1190,1192,1196,1198,1199,1200,1201,1209,1211,1212,1213,1214,1222,1223,1224,1225,1226,1227,1235,1236,1237,1238,1239,1240,1248,1249,1251,1252,1253,1257,1261,1262,1264,1265,1266,1270,1274,1275,1276,1277,1278,1279,1282,1283,1287,1288,1289,1290,1291,1292,1295,1296,1309,1322,1326,1327,1328,1329,1330,1331,1333,1334,1335,1339,1340,1341,1342,1343,1344,1346,1347,1348,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1411,1425,1427,1489,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499,1500,1501,1502,1503,1517,1529,1530,1532,1537,1540,1541,1542,1546,1575,1576,1577,1578,1579,1580,1581,1582,1583,1584,1585,1586,1587,1588,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598,1599,1600,1601,1602,1603,1605,1606,1608,1609,1610,1613,1614,1615,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682,1683,1684,1685,1686,1687,1688,1689,1690,1691,1692,1693,1694,1695,1696,1697,1698,1699,1700,1701,1702,1703,1704,1705,1706,1707,1725,1726,1727,1728,1729) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Analyze Data

First, let's see a little bit of the data



In [4]:

    
all_data.head()









    Out[4]:







  
    
      
      UNITID
      OPEID
      OPEID6
      INSTNM
      CITY
      STABBR
      ZIP
      ACCREDAGENCY
      INSTURL
      NPCURL
      ...
      D100_L4
      TRANS_4
      DTRANS_4
      TRANS_L4
      DTRANS_L4
      ICLEVEL
      UGDS_MEN
      UGDS_WOMEN
      CDR3_DENOM
      CDR2_DENOM
    
  
  
    
      0
      100654
      100200
      1002
      Alabama A & M University
      Normal
      AL
      35762
      Southern Association of Colleges and Schools C...
      www.aamu.edu/
      galileo.aamu.edu/netpricecalculator/npcalc.htm
      ...
      NaN
      0.0000
      1044.0
      NaN
      NaN
      1
      0.4831
      0.5169
      1895.0
      1574.0
    
    
      1
      100663
      105200
      1052
      University of Alabama at Birmingham
      Birmingham
      AL
      35294-0110
      Southern Association of Colleges and Schools C...
      www.uab.edu
      www.collegeportraits.org/AL/UAB/estimator/agree
      ...
      NaN
      0.2408
      1246.0
      NaN
      NaN
      1
      0.4169
      0.5831
      3937.0
      3481.0
    
    
      2
      100690
      2503400
      25034
      Amridge University
      Montgomery
      AL
      36117-3553
      Southern Association of Colleges and Schools C...
      www.amridgeuniversity.edu
      www2.amridgeuniversity.edu:9091/
      ...
      NaN
      0.0000
      9.0
      NaN
      NaN
      1
      0.3986
      0.6014
      344.0
      336.0
    
    
      3
      100706
      105500
      1055
      University of Alabama in Huntsville
      Huntsville
      AL
      35899
      Southern Association of Colleges and Schools C...
      www.uah.edu
      finaid.uah.edu/
      ...
      NaN
      0.3477
      765.0
      NaN
      NaN
      1
      0.5733
      0.4267
      1513.0
      1392.0
    
    
      4
      100724
      100500
      1005
      Alabama State University
      Montgomery
      AL
      36104-0271
      Southern Association of Colleges and Schools C...
      www.alasu.edu
      www.alasu.edu/cost-aid/forms/calculator/index....
      ...
      NaN
      0.0000
      1330.0
      NaN
      NaN
      1
      0.3877
      0.6123
      2137.0
      1961.0
    
  

5 rows × 1743 columns

Find information about the features

Let's find more about the data



In [5]:

    
all_data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1743 entries, UNITID to CDR2_DENOM
dtypes: float64(443), int64(13), object(1287)
memory usage: 102.4+ MB

There are 7703 examples and 1743 features.

There are 443 float features that may be numeric, 13 integer features that may be categorical, and 1287 features that are strings, but may be numbers but data was not entered correctly (for example, if there was not data for a given feature, someone could have written "blank"). Given the high number of non numerical features, we need to explore them more. Luckly, there is a dictionary provided with the data, so we can explore it a little bit to learn about the data (The original file was converted do CSV)



In [6]:

    
data_dict = pd.read_csv('datasets/CollegeScorecardDataDictionary.csv')



In [7]:

    
data_dict.head()









    Out[7]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      0
      Unit ID for institution
      root
      id
      integer
      UNITID
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      1
      8-digit OPE ID for institution
      root
      ope8_id
      integer
      OPEID
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      2
      6-digit OPE ID for institution
      root
      ope6_id
      integer
      OPEID6
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      3
      Institution name
      school
      name
      autocomplete
      INSTNM
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      4
      City
      school
      city
      autocomplete
      CITY
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.



In [8]:

    
data_dict.tail()









    Out[8]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      1970
      NaN
      school
      NaN
      NaN
      NaN
      3.0
      Less-than-2-year
      NaN
      NaN
    
    
      1971
      Total share of enrollment of undergraduate deg...
      student
      demographics.men
      float
      UGDS_MEN
      NaN
      NaN
      IPEDS
      NaN
    
    
      1972
      Total share of enrollment of undergraduate deg...
      student
      demographics.women
      float
      UGDS_WOMEN
      NaN
      NaN
      IPEDS
      NaN
    
    
      1973
      Number of students in the cohort for the two-y...
      repayment
      2_yr_default_rate_denom
      integer
      CDR2_DENOM
      NaN
      NaN
      FSA
      NaN
    
    
      1974
      Number of students in the cohort for the three...
      repayment
      3_yr_default_rate_denom
      integer
      CDR3_DENOM
      NaN
      NaN
      FSA
      NaN



In [9]:

    
data_dict.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1975 entries, 0 to 1974
Data columns (total 9 columns):
NAME OF DATA ELEMENT       1734 non-null object
dev-category               1865 non-null object
developer-friendly name    1734 non-null object
API data type              1734 non-null object
VARIABLE NAME              1734 non-null object
VALUE                      268 non-null float64
LABEL                      306 non-null object
SOURCE                     1734 non-null object
NOTES                      274 non-null object
dtypes: float64(1), object(8)
memory usage: 138.9+ KB

There are 1975 entries, but the column NAME OF DATA ELEMENT has only 1734 not nut elements, so something is up. Let's try to explore the dict a little bit more



In [10]:

    
data_dict[5:10]









    Out[10]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      5
      State postcode
      school
      state
      string
      STABBR
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      6
      ZIP code
      school
      zip
      integer
      ZIP
      NaN
      NaN
      IPEDS
      NaN
    
    
      7
      Accreditor for institution
      school
      accreditor
      string
      ACCREDAGENCY
      NaN
      NaN
      FSA
      NaN
    
    
      8
      URL for institution's homepage
      school
      school_url
      string
      INSTURL
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      9
      URL for institution's net price calculator
      school
      price_calculator_url
      string
      NPCURL
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.

Nothing suspicius here, lets try again



In [11]:

    
data_dict[10:20]









    Out[11]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      10
      Predominant degree awarded (recoded 0s and 4s)
      school
      degrees_awarded.predominant_recoded
      integer
      SCH_DEG
      NaN
      NaN
      IPEDS/NSLDS
      Missing values, 0s, and 4s from PREDDEG recode...
    
    
      11
      Schools that are on Heightened Cash Monitoring...
      school
      under_investigation
      integer
      HCM2
      NaN
      NaN
      FSA
      Shown/used on consumer website;    Flag (1=HCM2)
    
    
      12
      Flag for main campus
      school
      main_campus
      integer
      MAIN
      0.0
      Not main campus
      IPEDS
      NaN
    
    
      13
      NaN
      school
      NaN
      NaN
      NaN
      1.0
      Main campus
      NaN
      NaN
    
    
      14
      Number of branch campuses
      school
      branches
      integer
      NUMBRANCH
      NaN
      NaN
      IPEDS
      NaN
    
    
      15
      Predominant undergraduate degree awarded\n 0 N...
      school
      degrees_awarded.predominant
      integer
      PREDDEG
      0.0
      Not classified
      IPEDS
      Shown/used on consumer website.
    
    
      16
      NaN
      school
      NaN
      NaN
      NaN
      1.0
      Predominantly certificate-degree granting
      NaN
      NaN
    
    
      17
      NaN
      school
      NaN
      NaN
      NaN
      2.0
      Predominantly associate's-degree granting
      NaN
      NaN
    
    
      18
      NaN
      school
      NaN
      NaN
      NaN
      3.0
      Predominantly bachelor's-degree granting
      NaN
      NaN
    
    
      19
      NaN
      school
      NaN
      NaN
      NaN
      4.0
      Entirely graduate-degree granting
      NaN
      NaN

Aha! It seems that the feature at index 15 is categorical, and that's why the rows that follow it don't have a value under NAME OF DATA ELEMENT. Just for now, let's get rid of those NAN rows.



In [12]:

    
data_dict_no_nan_names = data_dict.dropna(subset=['NAME OF DATA ELEMENT'])
data_dict_no_nan_names[10:20]









    Out[12]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      10
      Predominant degree awarded (recoded 0s and 4s)
      school
      degrees_awarded.predominant_recoded
      integer
      SCH_DEG
      NaN
      NaN
      IPEDS/NSLDS
      Missing values, 0s, and 4s from PREDDEG recode...
    
    
      11
      Schools that are on Heightened Cash Monitoring...
      school
      under_investigation
      integer
      HCM2
      NaN
      NaN
      FSA
      Shown/used on consumer website;    Flag (1=HCM2)
    
    
      12
      Flag for main campus
      school
      main_campus
      integer
      MAIN
      0.0
      Not main campus
      IPEDS
      NaN
    
    
      14
      Number of branch campuses
      school
      branches
      integer
      NUMBRANCH
      NaN
      NaN
      IPEDS
      NaN
    
    
      15
      Predominant undergraduate degree awarded\n 0 N...
      school
      degrees_awarded.predominant
      integer
      PREDDEG
      0.0
      Not classified
      IPEDS
      Shown/used on consumer website.
    
    
      20
      Highest degree awarded\n 0 Non-degree-granting...
      school
      degrees_awarded.highest
      integer
      HIGHDEG
      0.0
      Non-degree-granting
      IPEDS
      Shown/used on consumer website.
    
    
      25
      Control of institution
      school
      ownership
      integer
      CONTROL
      1.0
      Public
      IPEDS
      Shown/used on consumer website.
    
    
      28
      FIPS code for state
      school
      state_fips
      integer
      ST_FIPS
      1.0
      Alabama
      IPEDS
      NaN
    
    
      86
      Region (IPEDS)
      school
      region_id
      integer
      REGION
      0.0
      U.S. Service Schools
      IPEDS
      NaN
    
    
      96
      Locale of institution
      school
      locale
      integer
      LOCALE
      11.0
      City: Large (population of 250,000 or more)
      IPEDS
      Shown/used on consumer website.

Lets get the info of the new dict



In [13]:

    
data_dict_no_nan_names.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1734 entries, 0 to 1974
Data columns (total 9 columns):
NAME OF DATA ELEMENT       1734 non-null object
dev-category               1734 non-null object
developer-friendly name    1734 non-null object
API data type              1734 non-null object
VARIABLE NAME              1734 non-null object
VALUE                      27 non-null float64
LABEL                      65 non-null object
SOURCE                     1734 non-null object
NOTES                      274 non-null object
dtypes: float64(1), object(8)
memory usage: 135.5+ KB

We are interested primarly in the NAME OF DATA ELEMENT, VARIABLE NAME and API data type. They seem complete. Let's see howe many data types there are



In [14]:

    
data_dict_no_nan_names['API data type'].unique()









    Out[14]:





array(['integer', 'autocomplete', 'string', 'float'], dtype=object)

Let's find out how many features have each data type



In [15]:

    
data_dict_no_nan_names['API data type'].value_counts()









    Out[15]:





float           1206
integer          521
string             4
autocomplete       3
Name: API data type, dtype: int64

So in reality, there are 1206 float features, 521 integers, and 7 string features. (For now we assume that the autocomplete type is string). This numbers differ a lot from our previus analisys, in which we had 443 float features, 13 integer features and 1287 features that are strings.

Also, we cannot asume that all features of type integer are categorical, for example the ZIP code feature is integer but is not a categorical feature.

Let's find more about the autocomplete features:



In [16]:

    
data_dict_no_nan_names[data_dict_no_nan_names['API data type'] == 'autocomplete']









    Out[16]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      3
      Institution name
      school
      name
      autocomplete
      INSTNM
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      4
      City
      school
      city
      autocomplete
      CITY
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      1959
      Institution name aliases
      school
      alias
      autocomplete
      ALIAS
      NaN
      NaN
      IPEDS
      NaN

We can see that these autocomplete features can be treated as strings.

Delete features that have all their values NaN



In [17]:

    
all_data_no_na_columns = all_data.dropna(axis=1, how='all')

Delete features that are meaningless

There are features that are meaningless for the problem we are trying to solve. We need to drop these features, but we need a criterion to eliminate them. The criterion that we are going to employ is to eliminate the features that are unique for every entry and don't add information to the problem, for example if we have a unique ID for every institution, this ID doesn't add information to the problem.

Also, we need to take in account that there area features that may be unique for every entry, but DOES add relevant information. For example, the tuition fees may be unique and add information.

Let's find the ratio of the number of unique values over number of examples:



In [18]:

    
# Create a list to save the features that are above a certain threshold
features_with_high_ratio = []
# Create a list to save the features in all_data but not in the dict
features_not_in_dict = []

#Calculate the ratio
for feature in all_data_no_na_columns.columns.values:    
    # Get the row in the dict wich have VARIABLE NAME == feature
    row_in_dict = data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == feature]
    # Get the data type of the row
    data_type_series = row_in_dict['API data type']
    
    #Check if exists in the dict
    if data_type_series.size > 0:
        # Get the data type
        data_type = data_type_series.values[0]
        # float features (numeric features) are not taken in account
        if data_type == 'integer' or data_type == 'string' or data_type ==  'autocomplete':
            column = all_data_no_na_columns[feature]
            column_no_na = column.dropna()
            r = column_no_na.unique().size / column_no_na.size
            if r > 0.8:
                features_with_high_ratio.append(feature)
                print(str(feature) + ": " + str(r))
    #The feature is not in the dict
    else:
        features_not_in_dict.append(feature)

print ("\nFeatures in data but not in the dictionary:" + str(features_not_in_dict))









    



UNITID: 1.0
OPEID: 0.9962352330260937
INSTNM: 0.9781903154615085
ZIP: 0.8468129300272621
INSTURL: 0.8054025701547338
NPT4_PUB: 0.9419152276295133
NPT4_PRIV: 0.9127559726962458
NPT4_PROG: 0.8748801534036433
NPT4_OTHER: 1.0
NPT41_PUB: 0.9344520188778186
NPT42_PUB: 0.929042904290429
NPT43_PUB: 0.9318568994889267
NPT44_PUB: 0.9426386233269598
NPT45_PUB: 0.9683397683397683
NPT41_PRIV: 0.897950377562028
NPT42_PRIV: 0.9219765929778934
NPT43_PRIV: 0.9270772806507844
NPT44_PRIV: 0.9314720812182741
NPT45_PRIV: 0.9492204899777282
NPT41_PROG: 0.8815915627996165
NPT42_PROG: 0.8791946308724832
NPT43_PROG: 0.87535953978907
NPT44_PROG: 0.8815915627996165
NPT45_PROG: 0.8796740172579098
NPT41_OTHER: 1.0
NPT42_OTHER: 1.0
NPT43_OTHER: 1.0
NPT44_OTHER: 1.0
NPT45_OTHER: 1.0
NPT4_048_PUB: 0.9351125065410779
NPT4_048_PRIV: 0.9043739279588336
NPT4_048_PROG: 0.9616122840690979
NPT4_048_OTHER: 1.0
NPT4_3075_PUB: 0.9302452316076294
NPT4_3075_PRIV: 0.9064928553522186
NPT4_75UP_PUB: 0.9539722572509458
NPT4_75UP_PRIV: 0.9351978171896317
NPT4_3075_PROG: 0.9706235011990407
NPT4_3075_OTHER: 0.9941520467836257
NPT4_75UP_PROG: 0.9839228295819936
NPT4_75UP_OTHER: 0.993421052631579
NUM4_OTHER: 0.9171270718232044
NUM41_OTHER: 0.8870056497175142
COSTT4_A: 0.9387096774193548
COSTT4_P: 0.9354336833916764
TUITFTE: 0.8475928473177442
MEDIAN_HH_INC: 0.8532386867790595
ALIAS: 0.9366391184573003

Features in data but not in the dictionary:['C200_4', 'C200_L4', 'D200_4', 'D200_L4', 'C200_4_POOLED', 'C200_L4_POOLED', 'POOLYRS200', 'D200_4_POOLED', 'D200_L4_POOLED']

So there are some features in the data that are not explained in the dictionary. Tha is not necessarly an inconvenience, so we won't worry abot this right now.

Lets find what those NTP4 features are about



In [19]:

    
npt4_pub = data_dict_no_nan_names['VARIABLE NAME'] == 'NPT4_PUB'
npt41_pub = data_dict_no_nan_names['VARIABLE NAME'] == 'NPT41_PUB'
npt42_pub = data_dict_no_nan_names['VARIABLE NAME'] == 'NPT42_PUB'
data_dict_no_nan_names[npt4_pub | npt41_pub | npt42_pub ]









    Out[19]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      555
      Average net price for Title IV institutions (p...
      cost
      avg_net_price.public
      integer
      NPT4_PUB
      NaN
      NaN
      IPEDS
      Shown/used on consumer website;    The average...
    
    
      559
      Average net price for $0-$30,000 family income...
      cost
      net_price.public.by_income_level.0-30000
      integer
      NPT41_PUB
      NaN
      NaN
      IPEDS
      Shown/used on consumer website;    The average...
    
    
      560
      Average net price for $30,001-$48,000 family i...
      cost
      net_price.public.by_income_level.30001-48000
      integer
      NPT42_PUB
      NaN
      NaN
      IPEDS
      Shown/used on consumer website;    ibid

So those NTP4 features are about Average Net prices, so they are defenetly numeric features, and it makes sense to keep them.

Let's run our previous analysis again with out those features so we can have a cleaner visualization as we lower the threshold



In [20]:

    
# Create a list to save the features that are above a certain threshold
features_with_high_ratio = []
# Create a list to save the features in all_data but not in the dict
features_not_in_dict = []

#Calculate the ratio
for feature in all_data_no_na_columns.columns.values:    
    # Get the row in the dict wich have VARIABLE NAME == feature
    row_in_dict = data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == feature]
    # Get the data type of the row
    data_type_series = row_in_dict['API data type']
    
    #Check if exists in the dict
    if data_type_series.size > 0:
        # Get the data type
        data_type = data_type_series.values[0]
        # float features (numeric features) are not taken in account
        if (data_type == 'integer' or data_type == 'string' or data_type ==  'autocomplete') \
        and feature[:4] != 'NPT4':
            column = all_data_no_na_columns[feature]
            column_no_na = column.dropna()
            r = column_no_na.unique().size / column_no_na.size
            if r > 0.5:
                features_with_high_ratio.append(feature)
                print(str(feature) + ": " + str(r))
print(features_with_high_ratio)









    



UNITID: 1.0
OPEID: 0.9962352330260937
OPEID6: 0.7350382967674932
INSTNM: 0.9781903154615085
ZIP: 0.8468129300272621
INSTURL: 0.8054025701547338
NPCURL: 0.7592903045543448
UG: 0.5519023986765922
NUM4_PUB: 0.5007849293563579
NUM4_OTHER: 0.9171270718232044
NUM41_OTHER: 0.8870056497175142
NUM42_OTHER: 0.5649717514124294
COSTT4_A: 0.9387096774193548
COSTT4_P: 0.9354336833916764
TUITIONFEE_IN: 0.6820980615735461
TUITIONFEE_OUT: 0.7219698780779344
TUITFTE: 0.8475928473177442
INEXPFTE: 0.7491059147180192
AVGFACSAL: 0.7490101187857457
D150_4_POOLED: 0.6139367240688827
DEP_INC_AVG: 0.6594637855142057
IND_INC_AVG: 0.6594637855142057
AGE_ENTRY: 0.7036147792450314
FAMINC: 0.7158863545418167
MD_FAMINC: 0.6882753101240496
FAMINC_IND: 0.6594637855142057
MEDIAN_HH_INC: 0.8532386867790595
ALIAS: 0.9366391184573003
['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'ZIP', 'INSTURL', 'NPCURL', 'UG', 'NUM4_PUB', 'NUM4_OTHER', 'NUM41_OTHER', 'NUM42_OTHER', 'COSTT4_A', 'COSTT4_P', 'TUITIONFEE_IN', 'TUITIONFEE_OUT', 'TUITFTE', 'INEXPFTE', 'AVGFACSAL', 'D150_4_POOLED', 'DEP_INC_AVG', 'IND_INC_AVG', 'AGE_ENTRY', 'FAMINC', 'MD_FAMINC', 'FAMINC_IND', 'MEDIAN_HH_INC', 'ALIAS']

Let's see what are these features about:



In [21]:

    
high_ratio_features = pd.DataFrame()
for feature in features_with_high_ratio:
    high_ratio_features = high_ratio_features.append(data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == feature])
high_ratio_features









    Out[21]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      0
      Unit ID for institution
      root
      id
      integer
      UNITID
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      1
      8-digit OPE ID for institution
      root
      ope8_id
      integer
      OPEID
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      2
      6-digit OPE ID for institution
      root
      ope6_id
      integer
      OPEID6
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      3
      Institution name
      school
      name
      autocomplete
      INSTNM
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      6
      ZIP code
      school
      zip
      integer
      ZIP
      NaN
      NaN
      IPEDS
      NaN
    
    
      8
      URL for institution's homepage
      school
      school_url
      string
      INSTURL
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      9
      URL for institution's net price calculator
      school
      price_calculator_url
      string
      NPCURL
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      529
      Enrollment of all undergraduate students
      student
      enrollment.all
      integer
      UG
      NaN
      NaN
      IPEDS
      Available in 2000-2001 datafile only
    
    
      591
      Number of Title IV students (public institutions)
      cost
      title_iv.public.all
      integer
      NUM4_PUB
      NaN
      NaN
      IPEDS
      The number of full-time, first-time, degree/ce...
    
    
      594
      Number of Title IV students (other academic ca...
      cost
      title_iv.academic_year
      integer
      NUM4_OTHER
      NaN
      NaN
      IPEDS
      ibid
    
    
      610
      Number of Title IV students, $0-$30,000 family...
      cost
      title_iv.other_acad_calendar.by_income_level.0...
      integer
      NUM41_OTHER
      NaN
      NaN
      IPEDS
      ibid
    
    
      611
      Number of Title IV students, $30,001-$48,000 f...
      cost
      title_iv.other_acad_calendar.by_income_level.3...
      integer
      NUM42_OTHER
      NaN
      NaN
      IPEDS
      ibid
    
    
      615
      Average cost of attendance (academic year inst...
      cost
      attendance.academic_year
      integer
      COSTT4_A
      NaN
      NaN
      IPEDS
      The average annual total cost of attendance, i...
    
    
      616
      Average cost of attendance (program-year insti...
      cost
      attendance.program_year
      integer
      COSTT4_P
      NaN
      NaN
      IPEDS
      ibid
    
    
      617
      In-state tuition and fees
      cost
      tuition.in_state
      integer
      TUITIONFEE_IN
      NaN
      NaN
      IPEDS
      NaN
    
    
      618
      Out-of-state tuition and fees
      cost
      tuition.out_of_state
      integer
      TUITIONFEE_OUT
      NaN
      NaN
      IPEDS
      NaN
    
    
      620
      Net tuition revenue per full-time equivalent s...
      school
      tuition_revenue_per_fte
      integer
      TUITFTE
      NaN
      NaN
      IPEDS
      Net tuition revenue (tuition revenue minus dis...
    
    
      621
      Instructional expenditures per full-time equiv...
      school
      instructional_expenditure_per_fte
      integer
      INEXPFTE
      NaN
      NaN
      IPEDS
      Instructional expenditures divided by the numb...
    
    
      622
      Average faculty salary
      school
      faculty_salary
      integer
      AVGFACSAL
      NaN
      NaN
      IPEDS
      Average faculty salary per month, calculated f...
    
    
      633
      Adjusted cohort count for completion rate at f...
      completion
      completion_cohort_4yr_150nt_pooled
      integer
      D150_4_POOLED
      NaN
      NaN
      IPEDS
      NaN
    
    
      1661
      Average family income of dependent students in...
      student
      avg_dependent_income.2014dollars
      integer
      DEP_INC_AVG
      NaN
      NaN
      NSLDS
      NaN
    
    
      1662
      Average family income of independent students ...
      student
      avg_independent_income.2014dollars
      integer
      IND_INC_AVG
      NaN
      NaN
      NSLDS
      NaN
    
    
      1836
      Average age of entry
      student
      demographics.age_entry
      integer
      AGE_ENTRY
      NaN
      NaN
      Treasury
      Data element describes the earnings cohort (wi...
    
    
      1844
      Average family income in real 2015 dollars
      student
      demographics.avg_family_income
      integer
      FAMINC
      NaN
      NaN
      NSLDS
      Data element describes the earnings cohort (wi...
    
    
      1845
      Median family income in real 2015 dollars
      student
      demographics.median_family_income
      integer
      MD_FAMINC
      NaN
      NaN
      NSLDS
      Data element describes the earnings cohort (wi...
    
    
      1846
      Average family income for independent students...
      student
      demographics.avg_family_income_independents
      integer
      FAMINC_IND
      NaN
      NaN
      NSLDS
      Data element describes the earnings cohort (wi...
    
    
      1856
      Median household income
      student
      demographics.median_hh_income
      integer
      MEDIAN_HH_INC
      NaN
      NaN
      Treasury
      Data element describes the earnings cohort (wi...
    
    
      1959
      Institution name aliases
      school
      alias
      autocomplete
      ALIAS
      NaN
      NaN
      IPEDS
      NaN

So UNITID, OPEID, OPEID6, INSTNM, INSTURL, NPCURL and ALIAS are features that have to do with the identity of the institution, so they don't add relevant information to the problem, therfore they will be eliminated. (flag_e)

The ZIP code could be useful if it is used to group the schools to some sort of category about it's location. We are not going to to this so we are going to eliminate it as well.



In [22]:

    
all_data_no_id_cols = all_data_no_na_columns.drop(['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'INSTURL', 'NPCURL', 'ALIAS', 'ZIP'], axis = 1)



In [23]:

    
all_data_no_id_cols.head()









    Out[23]:







  
    
      
      CITY
      STABBR
      ACCREDAGENCY
      SCH_DEG
      HCM2
      MAIN
      NUMBRANCH
      PREDDEG
      HIGHDEG
      CONTROL
      ...
      D100_L4
      TRANS_4
      DTRANS_4
      TRANS_L4
      DTRANS_L4
      ICLEVEL
      UGDS_MEN
      UGDS_WOMEN
      CDR3_DENOM
      CDR2_DENOM
    
  
  
    
      0
      Normal
      AL
      Southern Association of Colleges and Schools C...
      3.0
      0
      1
      1
      3
      4
      1
      ...
      NaN
      0.0000
      1044.0
      NaN
      NaN
      1
      0.4831
      0.5169
      1895.0
      1574.0
    
    
      1
      Birmingham
      AL
      Southern Association of Colleges and Schools C...
      3.0
      0
      1
      1
      3
      4
      1
      ...
      NaN
      0.2408
      1246.0
      NaN
      NaN
      1
      0.4169
      0.5831
      3937.0
      3481.0
    
    
      2
      Montgomery
      AL
      Southern Association of Colleges and Schools C...
      3.0
      0
      1
      1
      3
      4
      2
      ...
      NaN
      0.0000
      9.0
      NaN
      NaN
      1
      0.3986
      0.6014
      344.0
      336.0
    
    
      3
      Huntsville
      AL
      Southern Association of Colleges and Schools C...
      3.0
      0
      1
      1
      3
      4
      1
      ...
      NaN
      0.3477
      765.0
      NaN
      NaN
      1
      0.5733
      0.4267
      1513.0
      1392.0
    
    
      4
      Montgomery
      AL
      Southern Association of Colleges and Schools C...
      3.0
      0
      1
      1
      3
      4
      1
      ...
      NaN
      0.0000
      1330.0
      NaN
      NaN
      1
      0.3877
      0.6123
      2137.0
      1961.0
    
  

5 rows × 1725 columns

Work on the string and autocmplet data



In [24]:

    
data_dict_no_nan_names[data_dict_no_nan_names['API data type'] == 'string']









    Out[24]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      5
      State postcode
      school
      state
      string
      STABBR
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      7
      Accreditor for institution
      school
      accreditor
      string
      ACCREDAGENCY
      NaN
      NaN
      FSA
      NaN
    
    
      8
      URL for institution's homepage
      school
      school_url
      string
      INSTURL
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      9
      URL for institution's net price calculator
      school
      price_calculator_url
      string
      NPCURL
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.

We already dropped INSTURL and NPCURL. Let's explore the STABBR feature



In [25]:

    
all_data_no_id_cols['STABBR']









    Out[25]:





0       AL
1       AL
2       AL
3       AL
4       AL
5       AL
6       AL
7       AL
8       AL
9       AL
10      AL
11      AL
12      AL
13      AL
14      AL
15      AL
16      AL
17      AL
18      AL
19      AL
20      AL
21      AL
22      AL
23      AL
24      AL
25      AL
26      AL
27      AL
28      AL
29      AL
        ..
7673    NC
7674    NJ
7675    GA
7676    GA
7677    GA
7678    GA
7679    GA
7680    GA
7681    GA
7682    GA
7683    SC
7684    SC
7685    TX
7686    TX
7687    TX
7688    TX
7689    TX
7690    TX
7691    TX
7692    TX
7693    CA
7694    CA
7695    TX
7696    TX
7697    CA
7698    KS
7699    OH
7700    CA
7701    CA
7702    TX
Name: STABBR, Length: 7703, dtype: object

So this feature has to do with the state where the school is located. Let's explore the ACCREDAGENCY feature:



In [26]:

    
all_data_no_id_cols['ACCREDAGENCY']









    Out[26]:





0       Southern Association of Colleges and Schools C...
1       Southern Association of Colleges and Schools C...
2       Southern Association of Colleges and Schools C...
3       Southern Association of Colleges and Schools C...
4       Southern Association of Colleges and Schools C...
5       Southern Association of Colleges and Schools C...
6       Southern Association of Colleges and Schools C...
7       Southern Association of Colleges and Schools C...
8       Southern Association of Colleges and Schools C...
9       Southern Association of Colleges and Schools C...
10      Southern Association of Colleges and Schools C...
11      Southern Association of Colleges and Schools C...
12      Southern Association of Colleges and Schools C...
13      Southern Association of Colleges and Schools C...
14      Southern Association of Colleges and Schools C...
15      Southern Association of Colleges and Schools C...
16      Southern Association of Colleges and Schools C...
17      Southern Association of Colleges and Schools C...
18      National Accrediting Commission of Cosmetology...
19      Southern Association of Colleges and Schools C...
20      Southern Association of Colleges and Schools C...
21      Southern Association of Colleges and Schools C...
22      North Central Association of Colleges and Scho...
23      Southern Association of Colleges and Schools C...
24               Association for Bibical Higher Educaiton
25      Southern Association of Colleges and Schools C...
26      Southern Association of Colleges and Schools C...
27      Southern Association of Colleges and Schools C...
28      Southern Association of Colleges and Schools C...
29      Southern Association of Colleges and Schools C...
                              ...                        
7673                                                  NaN
7674                    Council on Occupational Education
7675         Middle States Commission on Higher Education
7676         Middle States Commission on Higher Education
7677         Middle States Commission on Higher Education
7678         Middle States Commission on Higher Education
7679         Middle States Commission on Higher Education
7680         Middle States Commission on Higher Education
7681         Middle States Commission on Higher Education
7682                                                  NaN
7683         Middle States Commission on Higher Education
7684         Middle States Commission on Higher Education
7685         Middle States Commission on Higher Education
7686         Middle States Commission on Higher Education
7687         Middle States Commission on Higher Education
7688         Middle States Commission on Higher Education
7689         Middle States Commission on Higher Education
7690         Middle States Commission on Higher Education
7691         Middle States Commission on Higher Education
7692         Middle States Commission on Higher Education
7693    Accrediting Commission of Career Schools and C...
7694       Accrediting Bureau of Health Education Schools
7695                    Council on Occupational Education
7696                    Council on Occupational Education
7697    Accrediting Council for Independent Colleges a...
7698    North Central Association of Colleges and Scho...
7699    Accrediting Council for Continuing Education &...
7700    Accrediting Council for Independent Colleges a...
7701    Accrediting Council for Continuing Education &...
7702                    Council on Occupational Education
Name: ACCREDAGENCY, Length: 7703, dtype: object



In [27]:

    
all_data_no_id_cols['ACCREDAGENCY'].value_counts()









    Out[27]:





North Central Association of Colleges and Schools The Higher Learning Commission                                                                      1218
National Accrediting Commission of Cosmetology Arts and Sciences                                                                                      1014
Southern Association of Colleges and Schools Commission on Colleges                                                                                    859
Middle States Commission on Higher Education                                                                                                           656
Accrediting Commission of Career Schools and Colleges                                                                                                  555
Accrediting Council for Independent Colleges and Schools                                                                                               476
Council on Occupational Education                                                                                                                      472
New England Association of Schools and Colleges Commission on Institutions of Higher Education                                                         235
Western Association of Schools and Colleges Senior Colleges and University Commission                                                                  208
Northwest Commission on Colleges and Universities                                                                                                      174
Accrediting Bureau of Health Education Schools                                                                                                         169
Western Association of Schools and Colleges Accrediting Commission for Community and Junior Colleges                                                   159
Accrediting Council for Continuing Education & Training                                                                                                103
Commission on Accrediting of the Association of Theological Schools                                                                                     76
Association of Advanced Rabbinical and Talmudic Schools                                                                                                 62
Association for Bibical Higher Educaiton                                                                                                                49
Oklahoma State Board of Career and Technology Education                                                                                                 45
Transnational Association of Christian Colleges and Schools                                                                                             41
Accreditation Commission for Education in Nursing Inc.                                                                                                  40
Commission on Massage Therapy Accreditation                                                                                                             30
Accrediting Commission for Acupuncture and Oriental Medicine                                                                                            29
American Bar Association                                                                                                                                20
Middle States Commission on Secondary Education                                                                                                         15
New York State Board of Regents State Education Department Office of the Professions (Public Postsecondary Vocational Education Practical Nursing)      14
New York State Board of Regents and the Commissioner of Education                                                                                       13
Distance Education Accrediting Commission                                                                                                               12
National Association of Schools of Arts and Design                                                                                                       9
Pennsylvania State Board of Vocational Education Bureau of Career and Technical Education                                                                7
National Association of Schools of Theatre                                                                                                               7
Joint Review Committee on Education in Radiologic Technology                                                                                             7
American Board of Funeral Service Education Committee on Accreditation                                                                                   7
National Association of Schools of Music                                                                                                                 6
Puerto Rico State Agency for the Approval of Public Postsecondary Vocational Technical Institutions and Programs                                         4
Midwifery Education Accreditation Council                                                                                                                4
American Osteopathic Association                                                                                                                         3
Montessori Accreditation Council for Teacher Education                                                                                                   3
National Association of Schools of Dance                                                                                                                 2
Council on Accreditation of Nurse Anesthesia Educational Programs                                                                                        1
American Podiatric Medical Association                                                                                                                   1
Name: ACCREDAGENCY, dtype: int64

Now les's explore the autocomplete data type:



In [28]:

    
data_dict_no_nan_names[data_dict_no_nan_names['API data type'] == 'autocomplete']









    Out[28]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      3
      Institution name
      school
      name
      autocomplete
      INSTNM
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      4
      City
      school
      city
      autocomplete
      CITY
      NaN
      NaN
      IPEDS
      Shown/used on consumer website.
    
    
      1959
      Institution name aliases
      school
      alias
      autocomplete
      ALIAS
      NaN
      NaN
      IPEDS
      NaN

INSTNM and ALIAS where dropped, let's see the CITY feature:



In [29]:

    
all_data_no_id_cols['CITY']









    Out[29]:





0                 Normal
1             Birmingham
2             Montgomery
3             Huntsville
4             Montgomery
5             Tuscaloosa
6         Alexander City
7                 Athens
8             Montgomery
9                 Auburn
10            Birmingham
11           Phenix City
12                 Selma
13            Montgomery
14            Enterprise
15           Bay Minette
16            Montgomery
17               Gadsden
18           Albertville
19                Dothan
20            Hanceville
21                 Selma
22            Birmingham
23            Montgomery
24              Florence
25            Huntsville
26          Jacksonville
27               Brewton
28            Birmingham
29                Tanner
              ...       
7673           Charlotte
7674         Jersey City
7675             Atlanta
7676              Morrow
7677             Roswell
7678        Douglasville
7679            Lithonia
7680            Savannah
7681             Augusta
7682            Columbus
7683            Columbia
7684    North Charleston
7685              Irving
7686             Houston
7687             Houston
7688               Plano
7689          Cedar Hill
7690              Dallas
7691         San Antonio
7692            Stafford
7693             Fremont
7694              Merced
7695             El Paso
7696              Austin
7697          Emeryville
7698       Overland Park
7699    Highland Heights
7700            San Jose
7701           Lancaster
7702         San Antonio
Name: CITY, Length: 7703, dtype: object

So STABBR, ACCREDAGENCY and CITY are features that we are going to keep, but they need to be transformed to an ordinal (using numbers) representation, since the ML algorithms use numbers and not strings.



In [30]:

    
all_data_no_strings = all_data_no_id_cols.copy()

#STABBR mapping
values = all_data_no_strings['STABBR'].unique()
mapping = {}
numeric_value = 1
for value in values:
    mapping[value] = numeric_value
    numeric_value += 1
all_data_no_strings['STABBR'] = all_data_no_strings['STABBR'].map(mapping)

#ACCREDAGENCY mapping
values = all_data_no_id_cols['ACCREDAGENCY'].unique()
mapping = {}
numeric_value = 1
for value in values:
    mapping[value] = numeric_value
    numeric_value += 1
all_data_no_strings['ACCREDAGENCY'] = all_data_no_strings['ACCREDAGENCY'].map(mapping)

#CITY mapping
values = all_data_no_id_cols['CITY'].unique()
mapping = {}
numeric_value = 1
for value in values:
    mapping[value] = numeric_value
    numeric_value += 1
all_data_no_strings['CITY'] = all_data_no_strings['CITY'].map(mapping)

all_data_no_strings.head()









    Out[30]:







  
    
      
      CITY
      STABBR
      ACCREDAGENCY
      SCH_DEG
      HCM2
      MAIN
      NUMBRANCH
      PREDDEG
      HIGHDEG
      CONTROL
      ...
      D100_L4
      TRANS_4
      DTRANS_4
      TRANS_L4
      DTRANS_L4
      ICLEVEL
      UGDS_MEN
      UGDS_WOMEN
      CDR3_DENOM
      CDR2_DENOM
    
  
  
    
      0
      1
      1
      1
      3.0
      0
      1
      1
      3
      4
      1
      ...
      NaN
      0.0000
      1044.0
      NaN
      NaN
      1
      0.4831
      0.5169
      1895.0
      1574.0
    
    
      1
      2
      1
      1
      3.0
      0
      1
      1
      3
      4
      1
      ...
      NaN
      0.2408
      1246.0
      NaN
      NaN
      1
      0.4169
      0.5831
      3937.0
      3481.0
    
    
      2
      3
      1
      1
      3.0
      0
      1
      1
      3
      4
      2
      ...
      NaN
      0.0000
      9.0
      NaN
      NaN
      1
      0.3986
      0.6014
      344.0
      336.0
    
    
      3
      4
      1
      1
      3.0
      0
      1
      1
      3
      4
      1
      ...
      NaN
      0.3477
      765.0
      NaN
      NaN
      1
      0.5733
      0.4267
      1513.0
      1392.0
    
    
      4
      3
      1
      1
      3.0
      0
      1
      1
      3
      4
      1
      ...
      NaN
      0.0000
      1330.0
      NaN
      NaN
      1
      0.3877
      0.6123
      2137.0
      1961.0
    
  

5 rows × 1725 columns

Let's see how our data looks so far



In [31]:

    
all_data_no_strings.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1725 entries, CITY to CDR2_DENOM
dtypes: float64(433), int64(13), object(1279)
memory usage: 101.4+ MB

Although we mapped or eliminated the string features, we still have a lot object (not numeric) data types. Let's work on them

Fetures with object dtype

Let's try to find a sample of features that should be numbers, but for some reason in the data they are not numbers



In [32]:

    
regex = re.compile('[0-9]+(\.[0-9]+)?$')
words = []
for column in all_data_no_strings:
    if all_data_no_strings[column].dtypes == 'object':
        for data in all_data_no_strings[column]:
            if not regex.match(str(data)):
                words.append(data)



In [33]:

    
pd.Series(words).value_counts()









    Out[33]:





PrivacySuppressed    5074226
12/31/2999              7613
03/24/2014                93
05/09/2014                89
05/16/2014                88
02/26/2014                85
05/08/2014                61
05/15/2014                57
06/29/2014                56
05/02/2014                51
07/09/2014                49
06/12/2014                47
05/14/2014                45
06/25/2014                45
06/11/2014                44
06/13/2014                44
05/13/2014                43
05/12/2014                43
07/27/2014                41
05/10/2014                40
05/22/2014                37
06/20/2014                34
07/07/2014                34
05/19/2014                34
05/23/2014                33
05/11/2014                31
06/26/2014                30
05/07/2014                30
06/24/2014                29
06/02/2014                29
                      ...   
11/18/2013                 1
08/08/2013                 1
02/16/2014                 1
07/24/2013                 1
09/24/2013                 1
12/23/2013                 1
07/20/2013                 1
08/09/2013                 1
12/04/2013                 1
05/02/2013                 1
11/21/2013                 1
12/11/2013                 1
02/22/2014                 1
09/01/2014                 1
12/14/2013                 1
01/10/2014                 1
07/28/2013                 1
11/03/2013                 1
10/25/2013                 1
12/02/2013                 1
12/27/2013                 1
10/31/2013                 1
09/11/2013                 1
03/23/2014                 1
08/19/2013                 1
05/29/2013                 1
12/30/2013                 1
11/08/2013                 1
05/07/2013                 1
02/02/2014                 1
Length: 312, dtype: int64

We can see that there is a lot of data suppresed for privacy reasons. Also, there are dates, and one of them 12/31/2999 seems to be invalid. Let's go ahead and replace these values with nan, so we will treat it as any nan value. Also, if any column ends having all of its values as Nan, we will delete this column.



In [34]:

    
all_data_replaced_with_nan = all_data_no_strings.replace(to_replace = 'PrivacySuppressed', value = np.nan)
all_data_replaced_with_nan = all_data_replaced_with_nan.replace(to_replace = '12/31/2999', value = np.nan)
all_data_replaced_with_nan = all_data_replaced_with_nan.dropna(axis=1, how='all')



In [35]:

    
all_data_replaced_with_nan.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1714 entries, CITY to CDR2_DENOM
dtypes: float64(433), int64(13), object(1268)
memory usage: 100.7+ MB

Lets find wich features are date features



In [36]:

    
features_with_date = []
for column in all_data_replaced_with_nan:
    if all_data_replaced_with_nan[column].dtypes == 'object':
        if all_data_replaced_with_nan[column].str.match('[0-9]{2}/[0-9]{2}/[0-9]{4}').any():
            features_with_date.append(column)



In [37]:

    
features_with_date









    Out[37]:





['SEPAR_DT_MDN']



In [38]:

    
data_dict_no_nan_names[data_dict_no_nan_names['VARIABLE NAME'] == 'SEPAR_DT_MDN']









    Out[38]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      1774
      Median Date Student Separated
      completion
      separation_date.median
      integer
      SEPAR_DT_MDN
      NaN
      NaN
      NSLDS
      NOTE: Students who had not yet separated by Se...

It seems that SEPAR_DT_MDN don't add valuable information to the problem, so we are going to drop it



In [39]:

    
all_data_no_dates = all_data_replaced_with_nan.drop(['SEPAR_DT_MDN'], axis = 1)

Now we will transfore all the object features to numeric



In [40]:

    
all_data_no_objects = all_data_no_dates.copy()
for feature in all_data_no_dates:
    if all_data_no_dates[feature].dtypes == 'object':
        #Make all data numeric
        all_data_no_objects[feature] = pd.to_numeric(all_data_no_dates[feature])



In [41]:

    
all_data_no_objects.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1713 entries, CITY to CDR2_DENOM
dtypes: float64(1700), int64(13)
memory usage: 100.7 MB

Now we have gotten rid of the object dtype

Eliminate features with high number of NaN values

We already deleted features with that had all of their value as NaN, but now we will eliminate features with a high percentage of NaN values (more than 90%)



In [42]:

    
high_nan_features = []
for feature in all_data_no_objects:
    size = all_data_no_objects[feature].size
    number_of_valid = all_data_no_objects[feature].count()
    number_of_nan =  size - number_of_valid
    ratio = number_of_nan / size
    if ratio > 0.9:
        high_nan_features.append(feature)
print (len(high_nan_features))



In [43]:

    
all_data_no_high_nan = all_data_no_objects.drop(high_nan_features, axis = 1)



In [44]:

    
all_data_no_high_nan.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1451 entries, CITY to CDR2_DENOM
dtypes: float64(1438), int64(13)
memory usage: 85.3 MB

Filling missing data

We need to fill the mising data. To do this we need to know if the feature is numeric or categorical. Let's use the dictionary to get that info.



In [45]:

    
data_dict[15:25]









    Out[45]:







  
    
      
      NAME OF DATA ELEMENT
      dev-category
      developer-friendly name
      API data type
      VARIABLE NAME
      VALUE
      LABEL
      SOURCE
      NOTES
    
  
  
    
      15
      Predominant undergraduate degree awarded\n 0 N...
      school
      degrees_awarded.predominant
      integer
      PREDDEG
      0.0
      Not classified
      IPEDS
      Shown/used on consumer website.
    
    
      16
      NaN
      school
      NaN
      NaN
      NaN
      1.0
      Predominantly certificate-degree granting
      NaN
      NaN
    
    
      17
      NaN
      school
      NaN
      NaN
      NaN
      2.0
      Predominantly associate's-degree granting
      NaN
      NaN
    
    
      18
      NaN
      school
      NaN
      NaN
      NaN
      3.0
      Predominantly bachelor's-degree granting
      NaN
      NaN
    
    
      19
      NaN
      school
      NaN
      NaN
      NaN
      4.0
      Entirely graduate-degree granting
      NaN
      NaN
    
    
      20
      Highest degree awarded\n 0 Non-degree-granting...
      school
      degrees_awarded.highest
      integer
      HIGHDEG
      0.0
      Non-degree-granting
      IPEDS
      Shown/used on consumer website.
    
    
      21
      NaN
      school
      NaN
      NaN
      NaN
      1.0
      Certificate degree
      NaN
      NaN
    
    
      22
      NaN
      school
      NaN
      NaN
      NaN
      2.0
      Associate degree
      NaN
      NaN
    
    
      23
      NaN
      school
      NaN
      NaN
      NaN
      3.0
      Bachelor's degree
      NaN
      NaN
    
    
      24
      NaN
      school
      NaN
      NaN
      NaN
      4.0
      Graduate degree
      NaN
      NaN

We can see that after the name of a categorical feature, there is at least one item with value NaN. Let's use this to get a list of categorical features



In [46]:

    
categorical_features = []
is_null = data_dict['NAME OF DATA ELEMENT'].isnull()
for i in range(len(is_null) - 1):
    if not is_null[i] and is_null[i+1]:
        categorical_features.append(data_dict['VARIABLE NAME'][i])

To fill the missing data that belongs to a categorical feature, we will use the most common value of the data (mode). To fill the missing data that belongs to a numeric feature, we will use the the average of the data (mean).



In [47]:

    
all_data_no_nan = all_data_no_high_nan.copy()
for feature in all_data_no_high_nan:
    if feature in categorical_features:
        mode = all_data_no_high_nan[feature].mode()[0]        
        all_data_no_nan[feature] = all_data_no_high_nan[feature].fillna(mode)
    else:
        mean = all_data_no_high_nan[feature].mean()
        all_data_no_nan[feature] = all_data_no_high_nan[feature].fillna(mean)



In [48]:

    
all_data_no_nan.head()









    Out[48]:







  
    
      
      CITY
      STABBR
      ACCREDAGENCY
      SCH_DEG
      HCM2
      MAIN
      NUMBRANCH
      PREDDEG
      HIGHDEG
      CONTROL
      ...
      D100_L4
      TRANS_4
      DTRANS_4
      TRANS_L4
      DTRANS_L4
      ICLEVEL
      UGDS_MEN
      UGDS_WOMEN
      CDR3_DENOM
      CDR2_DENOM
    
  
  
    
      0
      1
      1
      1
      3.0
      0
      1
      1
      3
      4
      1
      ...
      308.762923
      0.0000
      1044.0
      0.046848
      308.666836
      1
      0.4831
      0.5169
      1895.0
      1574.0
    
    
      1
      2
      1
      1
      3.0
      0
      1
      1
      3
      4
      1
      ...
      308.762923
      0.2408
      1246.0
      0.046848
      308.666836
      1
      0.4169
      0.5831
      3937.0
      3481.0
    
    
      2
      3
      1
      1
      3.0
      0
      1
      1
      3
      4
      2
      ...
      308.762923
      0.0000
      9.0
      0.046848
      308.666836
      1
      0.3986
      0.6014
      344.0
      336.0
    
    
      3
      4
      1
      1
      3.0
      0
      1
      1
      3
      4
      1
      ...
      308.762923
      0.3477
      765.0
      0.046848
      308.666836
      1
      0.5733
      0.4267
      1513.0
      1392.0
    
    
      4
      3
      1
      1
      3.0
      0
      1
      1
      3
      4
      1
      ...
      308.762923
      0.0000
      1330.0
      0.046848
      308.666836
      1
      0.3877
      0.6123
      2137.0
      1961.0
    
  

5 rows × 1451 columns



In [49]:

    
all_data_no_nan.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7703 entries, 0 to 7702
Columns: 1451 entries, CITY to CDR2_DENOM
dtypes: float64(1438), int64(13)
memory usage: 85.3 MB

Let's save the data in a file



In [50]:

    
all_data_no_nan.to_csv('datasets/CollegeScorecardDataCleaned.csv', index = False)



In [ ]:

	UNITID	OPEID	OPEID6	INSTNM	CITY	STABBR	ZIP	ACCREDAGENCY	INSTURL	NPCURL	...	D100_L4	TRANS_4	DTRANS_4	TRANS_L4	DTRANS_L4	ICLEVEL	UGDS_MEN	UGDS_WOMEN	CDR3_DENOM	CDR2_DENOM
0	100654	100200	1002	Alabama A & M University	Normal	AL	35762	Southern Association of Colleges and Schools C...	www.aamu.edu/	galileo.aamu.edu/netpricecalculator/npcalc.htm	...	NaN	0.0000	1044.0	NaN	NaN	1	0.4831	0.5169	1895.0	1574.0
1	100663	105200	1052	University of Alabama at Birmingham	Birmingham	AL	35294-0110	Southern Association of Colleges and Schools C...	www.uab.edu	www.collegeportraits.org/AL/UAB/estimator/agree	...	NaN	0.2408	1246.0	NaN	NaN	1	0.4169	0.5831	3937.0	3481.0
2	100690	2503400	25034	Amridge University	Montgomery	AL	36117-3553	Southern Association of Colleges and Schools C...	www.amridgeuniversity.edu	www2.amridgeuniversity.edu:9091/	...	NaN	0.0000	9.0	NaN	NaN	1	0.3986	0.6014	344.0	336.0
3	100706	105500	1055	University of Alabama in Huntsville	Huntsville	AL	35899	Southern Association of Colleges and Schools C...	www.uah.edu	finaid.uah.edu/	...	NaN	0.3477	765.0	NaN	NaN	1	0.5733	0.4267	1513.0	1392.0
4	100724	100500	1005	Alabama State University	Montgomery	AL	36104-0271	Southern Association of Colleges and Schools C...	www.alasu.edu	www.alasu.edu/cost-aid/forms/calculator/index....	...	NaN	0.0000	1330.0	NaN	NaN	1	0.3877	0.6123	2137.0	1961.0

	NAME OF DATA ELEMENT	dev-category	developer-friendly name	API data type	VARIABLE NAME	VALUE	LABEL	SOURCE	NOTES
0	Unit ID for institution	root	id	integer	UNITID	NaN	NaN	IPEDS	Shown/used on consumer website.
1	8-digit OPE ID for institution	root	ope8_id	integer	OPEID	NaN	NaN	IPEDS	Shown/used on consumer website.
2	6-digit OPE ID for institution	root	ope6_id	integer	OPEID6	NaN	NaN	IPEDS	Shown/used on consumer website.
3	Institution name	school	name	autocomplete	INSTNM	NaN	NaN	IPEDS	Shown/used on consumer website.
4	City	school	city	autocomplete	CITY	NaN	NaN	IPEDS	Shown/used on consumer website.

	NAME OF DATA ELEMENT	dev-category	developer-friendly name	API data type	VARIABLE NAME	VALUE	LABEL	SOURCE	NOTES
1970	NaN	school	NaN	NaN	NaN	3.0	Less-than-2-year	NaN	NaN
1971	Total share of enrollment of undergraduate deg...	student	demographics.men	float	UGDS_MEN	NaN	NaN	IPEDS	NaN
1972	Total share of enrollment of undergraduate deg...	student	demographics.women	float	UGDS_WOMEN	NaN	NaN	IPEDS	NaN
1973	Number of students in the cohort for the two-y...	repayment	2_yr_default_rate_denom	integer	CDR2_DENOM	NaN	NaN	FSA	NaN
1974	Number of students in the cohort for the three...	repayment	3_yr_default_rate_denom	integer	CDR3_DENOM	NaN	NaN	FSA	NaN

	NAME OF DATA ELEMENT	dev-category	developer-friendly name	API data type	VARIABLE NAME	VALUE	LABEL	SOURCE	NOTES
5	State postcode	school	state	string	STABBR	NaN	NaN	IPEDS	Shown/used on consumer website.
6	ZIP code	school	zip	integer	ZIP	NaN	NaN	IPEDS	NaN
7	Accreditor for institution	school	accreditor	string	ACCREDAGENCY	NaN	NaN	FSA	NaN
8	URL for institution's homepage	school	school_url	string	INSTURL	NaN	NaN	IPEDS	Shown/used on consumer website.
9	URL for institution's net price calculator	school	price_calculator_url	string	NPCURL	NaN	NaN	IPEDS	Shown/used on consumer website.

	NAME OF DATA ELEMENT	dev-category	developer-friendly name	API data type	VARIABLE NAME	VALUE	LABEL	SOURCE	NOTES
10	Predominant degree awarded (recoded 0s and 4s)	school	degrees_awarded.predominant_recoded	integer	SCH_DEG	NaN	NaN	IPEDS/NSLDS	Missing values, 0s, and 4s from PREDDEG recode...
11	Schools that are on Heightened Cash Monitoring...	school	under_investigation	integer	HCM2	NaN	NaN	FSA	Shown/used on consumer website; Flag (1=HCM2)
12	Flag for main campus	school	main_campus	integer	MAIN	0.0	Not main campus	IPEDS	NaN
13	NaN	school	NaN	NaN	NaN	1.0	Main campus	NaN	NaN
14	Number of branch campuses	school	branches	integer	NUMBRANCH	NaN	NaN	IPEDS	NaN
15	Predominant undergraduate degree awarded\n 0 N...	school	degrees_awarded.predominant	integer	PREDDEG	0.0	Not classified	IPEDS	Shown/used on consumer website.
16	NaN	school	NaN	NaN	NaN	1.0	Predominantly certificate-degree granting	NaN	NaN
17	NaN	school	NaN	NaN	NaN	2.0	Predominantly associate's-degree granting	NaN	NaN
18	NaN	school	NaN	NaN	NaN	3.0	Predominantly bachelor's-degree granting	NaN	NaN
19	NaN	school	NaN	NaN	NaN	4.0	Entirely graduate-degree granting	NaN	NaN

	NAME OF DATA ELEMENT	dev-category	developer-friendly name	API data type	VARIABLE NAME	VALUE	LABEL	SOURCE	NOTES
555	Average net price for Title IV institutions (p...	cost	avg_net_price.public	integer	NPT4_PUB	NaN	NaN	IPEDS	Shown/used on consumer website; The average...
559	Average net price for $0-$30,000 family income...	cost	net_price.public.by_income_level.0-30000	integer	NPT41_PUB	NaN	NaN	IPEDS	Shown/used on consumer website; The average...
560	Average net price for $30,001-$48,000 family i...	cost	net_price.public.by_income_level.30001-48000	integer	NPT42_PUB	NaN	NaN	IPEDS	Shown/used on consumer website; ibid

	CITY	STABBR	ACCREDAGENCY	SCH_DEG	MAIN	NUMBRANCH	PREDDEG	HIGHDEG	CONTROL	...	D100_L4	TRANS_4	DTRANS_4	TRANS_L4	DTRANS_L4	ICLEVEL	UGDS_MEN	UGDS_WOMEN	CDR3_DENOM	CDR2_DENOM
0	1	1	1	3.0	1	1	3	4	1	...	308.762923	0.0000	1044.0	0.046848	308.666836	1	0.4831	0.5169	1895.0	1574.0
1	2	1	1	3.0	1	1	3	4	1	...	308.762923	0.2408	1246.0	0.046848	308.666836	1	0.4169	0.5831	3937.0	3481.0
2	3	1	1	3.0	1	1	3	4	2	...	308.762923	0.0000	9.0	0.046848	308.666836	1	0.3986	0.6014	344.0	336.0
3	4	1	1	3.0	1	1	3	4	1	...	308.762923	0.3477	765.0	0.046848	308.666836	1	0.5733	0.4267	1513.0	1392.0
4	3	1	1	3.0	1	1	3	4	1	...	308.762923	0.0000	1330.0	0.046848	308.666836	1	0.3877	0.6123	2137.0	1961.0